Data Cleaning and Analysis of Q Cup 2018 Coffee Ratings using Python¶

A project for course "Data Exploration using Python" (STA 4243) at the University of Texas at San Antonio (UTSA)

Participants: Robert Hall, Dulce Ximena Cid Sanabria, Ryan Berberek, Max Moran

Table of Contents¶

  1. Preface and Data Dictionary
  2. Data and Library Importation
  3. Exploratory Data Analysis & Cleaning
  4. Quality Category Questions
    • What palate-related variable has the highest correlation with score?
    • Are there statistically significant diferences in palate-related quantifications with respect to diferent countries of origin?
    • Is there a correlation between altitude and certain taste quantifications?
  5. Location Metadata Questions
    • What countries have the highest overall score?
    • Which countries produce the most amounts of coffee?
    • What location-related variable has the highest correlation with score?
  6. Bean Metadata Questions
    • Which processing method is used the most?
    • Which processing method has the highest number of bags?
    • Did any product get graded the same year it was harvested?
  7. Q Grader Metadata Questions
    • What metrics from Q Graders have any significant effect on score?
    • Is this category useful?

Preface¶

Midterm Project Dataset Choice: Coffee¶

We chose the coffee dataset because of our group’s history with and enjoyment of the drink. We also considered how we might be biased when analyzing other options.

Our group approached the Coffee Quality Union's 2018 dataset with a focus on exploratory analysis to answer our profiling questions from our preliminary report, largely from correlating to final score, but with some correlations relating individual Quality category variables to Location Metadat such as Country or Region. We visualized and described our results to better outline the answers to these questions.

Data Dictionary¶

We separated variables by categories defined in the original "Coffee Quality Database" by James LeDoux, as well as some distinctions we made from our understanding.

Quality Description
Aftertaste The length of positive flavor (taste and aroma) qualities emanating from the back of the palate and remaining after the coffee is expectorated or swallowed. If the aftertaste were short or unpleasant, a lower score would be given.
Aroma The smell of the coffee when infused with hot water.
Acidity Acidity is often described as "brightness" when favorable or "sour" when unfavorable. The final score marked on the horizontal tick-mark scale should reflect the panelist's perceived quality for the Acidity relative to the expected flavor profile based on origin characteristics and/or other factors (degree of roast, intended use, etc.). Coffees expected to be high in Acidity, such as a Kenya coffee, or coffees expected to be low in Acidity, such as a Sumatra coffee, can receive equally high preference scores although their intensity rankings will be quite different.
Body Subfield of flavor. The score given for Flavor should account for the intensity, quality and complexity of its combined taste and aroma, experienced when the coffee is slurped into the mouth vigorously so as to involve the entire palate in the evaluation.
Balance Subfield of flavor. See above.
Clean Cup The lack of interfering negative impressions from first ingestion to final aftertaste, a "transparency" of cup.
Uniformity Uniformity refers to consistency of flavor of the different cups of the sample tasted. If the cups taste different, the rating of this aspect would not be as high.
Sweetness Scalar subfield of flavor. See above.
Moisture Subfield of flavor. Percentage format.
Bean Metadata Description
Processing Method Method for processing the beans.
Color Color of the beans.
Species Either Arabica or Robusta.
Variety Variety of the beans.
Quakers Count of Quakers in the beans sample. Quakers are under-developed coffee beans that can impact the flavor of coffee, as they look the same as regular beans. They can taste dry, ashy, bitter, astringent, and similar to burned popcorn.
Category One Defects "Category 1" defects are considered primary defects that significantly impact the taste of the coffee, like fully black beans, full sour beans, or pods.
Category Two Defects "Category 2" defects are secondary defects that have a less severe impact on taste, including things like broken beans, small sticks, or partial black beans.
# of Bags Number of bags tested.
Bag Weight Bag weight tested.
Grading Date When the beans were graded.
Expiration Date Expiration date of the beans.
Location Metadata Description
Owner Who owns the beans.
Country Where the beans came from.
Region Region where beans came from.
Company Name of the company that managed the beans.
Farm Name of the farm that grew the beans.
Producer Producer of the roasted bean.
Lot Lot number of the tested beans.
Mill Mill where the beans were processed.
Altitude (Low, High, Mean) Altitude low, high, and mean meters.
Q Grader Metadata Description
Certification Body Who certified the sample grade.
ICO Number International Coffee Organization number.
Unit of Measurement Unit of measurement.

Data and Library Importation¶

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import re
coffee = pd.read_csv('coffee_ratings.csv')
In [7]:
coffee.head()
Out[7]:
total_cup_points species owner country_of_origin farm_name lot_number mill ico_number company altitude ... color category_two_defects expiration certification_body certification_address certification_contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
0 90.58 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0
1 89.92 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 1 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0
2 89.75 Arabica grounds for health admin Guatemala san marcos barrancas "san cristobal cuch NaN NaN NaN NaN 1600 - 1800 m ... NaN 0 May 31st, 2011 Specialty Coffee Association 36d0d00a3724338ba7937c52a378d085f2172daa 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 m 1600.0 1800.0 1700.0
3 89.00 Arabica yidnekachew dabessa Ethiopia yidnekachew dabessa coffee plantation NaN wolensu NaN yidnekachew debessa coffee plantation 1800-2200 ... Green 2 March 25th, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1800.0 2200.0 2000.0
4 88.83 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 2 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0

5 rows × 43 columns

Exploratory Data Analysis¶

In [9]:
coffee.dtypes
Out[9]:
total_cup_points         float64
species                   object
owner                     object
country_of_origin         object
farm_name                 object
lot_number                object
mill                      object
ico_number                object
company                   object
altitude                  object
region                    object
producer                  object
number_of_bags             int64
bag_weight                object
in_country_partner        object
harvest_year              object
grading_date              object
owner_1                   object
variety                   object
processing_method         object
aroma                    float64
flavor                   float64
aftertaste               float64
acidity                  float64
body                     float64
balance                  float64
uniformity               float64
clean_cup                float64
sweetness                float64
cupper_points            float64
moisture                 float64
category_one_defects       int64
quakers                  float64
color                     object
category_two_defects       int64
expiration                object
certification_body        object
certification_address     object
certification_contact     object
unit_of_measurement       object
altitude_low_meters      float64
altitude_high_meters     float64
altitude_mean_meters     float64
dtype: object
In [10]:
coffee.columns
Out[10]:
Index(['total_cup_points', 'species', 'owner', 'country_of_origin',
       'farm_name', 'lot_number', 'mill', 'ico_number', 'company', 'altitude',
       'region', 'producer', 'number_of_bags', 'bag_weight',
       'in_country_partner', 'harvest_year', 'grading_date', 'owner_1',
       'variety', 'processing_method', 'aroma', 'flavor', 'aftertaste',
       'acidity', 'body', 'balance', 'uniformity', 'clean_cup', 'sweetness',
       'cupper_points', 'moisture', 'category_one_defects', 'quakers', 'color',
       'category_two_defects', 'expiration', 'certification_body',
       'certification_address', 'certification_contact', 'unit_of_measurement',
       'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'],
      dtype='object')
In [11]:
for i in coffee.columns:
    if pd.api.types.is_numeric_dtype(coffee[str(i)]):
        print(f"Column {str(i)} Maximum: {coffee[str(i)].max()}")
        print(f"Column {str(i)} Minimum: {coffee[str(i)].min()}")
        print('\n')
    else:
        continue
Column total_cup_points Maximum: 90.58
Column total_cup_points Minimum: 0.0


Column number_of_bags Maximum: 1062
Column number_of_bags Minimum: 0


Column aroma Maximum: 8.75
Column aroma Minimum: 0.0


Column flavor Maximum: 8.83
Column flavor Minimum: 0.0


Column aftertaste Maximum: 8.67
Column aftertaste Minimum: 0.0


Column acidity Maximum: 8.75
Column acidity Minimum: 0.0


Column body Maximum: 8.58
Column body Minimum: 0.0


Column balance Maximum: 8.75
Column balance Minimum: 0.0


Column uniformity Maximum: 10.0
Column uniformity Minimum: 0.0


Column clean_cup Maximum: 10.0
Column clean_cup Minimum: 0.0


Column sweetness Maximum: 10.0
Column sweetness Minimum: 0.0


Column cupper_points Maximum: 10.0
Column cupper_points Minimum: 0.0


Column moisture Maximum: 0.28
Column moisture Minimum: 0.0


Column category_one_defects Maximum: 63
Column category_one_defects Minimum: 0


Column quakers Maximum: 11.0
Column quakers Minimum: 0.0


Column category_two_defects Maximum: 55
Column category_two_defects Minimum: 0


Column altitude_low_meters Maximum: 190164.0
Column altitude_low_meters Minimum: 1.0


Column altitude_high_meters Maximum: 190164.0
Column altitude_high_meters Minimum: 1.0


Column altitude_mean_meters Maximum: 190164.0
Column altitude_mean_meters Minimum: 1.0


In [12]:
print(coffee['species'].value_counts())
species
Arabica    1311
Robusta      28
Name: count, dtype: int64

What palate-related variable has the highest correlation with score?¶

Features:¶

  • Aftertaste
  • Aroma
  • Acidity
  • Body
  • Balance
  • Clean Cup
  • Uniformity
  • Sweetness
  • Moisture
In [15]:
coffee.columns
Out[15]:
Index(['total_cup_points', 'species', 'owner', 'country_of_origin',
       'farm_name', 'lot_number', 'mill', 'ico_number', 'company', 'altitude',
       'region', 'producer', 'number_of_bags', 'bag_weight',
       'in_country_partner', 'harvest_year', 'grading_date', 'owner_1',
       'variety', 'processing_method', 'aroma', 'flavor', 'aftertaste',
       'acidity', 'body', 'balance', 'uniformity', 'clean_cup', 'sweetness',
       'cupper_points', 'moisture', 'category_one_defects', 'quakers', 'color',
       'category_two_defects', 'expiration', 'certification_body',
       'certification_address', 'certification_contact', 'unit_of_measurement',
       'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'],
      dtype='object')
In [16]:
coffee_salient = coffee[['total_cup_points', 'aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup', 'uniformity', 'sweetness', 'moisture']]
In [17]:
coffee_salient.head()
Out[17]:
total_cup_points aftertaste aroma acidity body balance clean_cup uniformity sweetness moisture
0 90.58 8.67 8.67 8.75 8.50 8.42 10.0 10.0 10.0 0.12
1 89.92 8.50 8.75 8.58 8.42 8.42 10.0 10.0 10.0 0.12
2 89.75 8.42 8.42 8.42 8.33 8.42 10.0 10.0 10.0 0.00
3 89.00 8.42 8.17 8.42 8.50 8.25 10.0 10.0 10.0 0.11
4 88.83 8.25 8.25 8.50 8.42 8.33 10.0 10.0 10.0 0.12
In [18]:
coffee_salient.isnull().sum()
Out[18]:
total_cup_points    0
aftertaste          0
aroma               0
acidity             0
body                0
balance             0
clean_cup           0
uniformity          0
sweetness           0
moisture            0
dtype: int64
In [19]:
features = coffee_salient[['aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup', 'uniformity', 'sweetness', 'moisture']]
labels = coffee_salient[['total_cup_points']]
In [20]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, labels, train_size=0.8, test_size=0.2)
In [21]:
from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
model = mlr.fit(x_train, y_train)
points_pred = model.predict(x_test)
In [22]:
print(features.columns)
Index(['aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup',
       'uniformity', 'sweetness', 'moisture'],
      dtype='object')
In [23]:
print(model.coef_)
[[ 1.96186763  1.36946563  1.35736291  1.02358272  1.28359187  1.04780508
   1.03838505  0.97678615 -0.09006529]]
In [24]:
coefs = []
for subset in model.coef_:
    for coef in subset:
        coefs.append(round(coef, 4))

print(coefs)

cols = [col for col in features.columns]
print(cols)

feature_coefs = pd.DataFrame({'features': cols,
                              'coefficients': coefs})

feature_coefs.sort_values('coefficients', ascending=False)
feature_coefs
[1.9619, 1.3695, 1.3574, 1.0236, 1.2836, 1.0478, 1.0384, 0.9768, -0.0901]
['aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup', 'uniformity', 'sweetness', 'moisture']
Out[24]:
features coefficients
0 aftertaste 1.9619
1 aroma 1.3695
2 acidity 1.3574
3 body 1.0236
4 balance 1.2836
5 clean_cup 1.0478
6 uniformity 1.0384
7 sweetness 0.9768
8 moisture -0.0901

Are there statistically significant diferences in palate-related quantifications with respect to diferent countries of origin?¶

In order to reduce overcomplication, only six nations of the many surveyed were chosen for this study. Countries were not randomly selected, and chosen based on geographical representation.

In [27]:
coffee['country_of_origin'].value_counts()
Out[27]:
country_of_origin
Mexico                          236
Colombia                        183
Guatemala                       181
Brazil                          132
Taiwan                           75
United States (Hawaii)           73
Honduras                         53
Costa Rica                       51
Ethiopia                         44
Tanzania, United Republic Of     40
Uganda                           36
Thailand                         32
Nicaragua                        26
Kenya                            25
El Salvador                      21
Indonesia                        20
China                            16
India                            14
Malawi                           11
United States                    10
Peru                             10
Myanmar                           8
Vietnam                           8
Haiti                             6
Philippines                       5
United States (Puerto Rico)       4
Panama                            4
Ecuador                           3
Laos                              3
Burundi                           2
Papua New Guinea                  1
Rwanda                            1
Zambia                            1
Japan                             1
Mauritius                         1
Cote d?Ivoire                     1
Name: count, dtype: int64
In [28]:
top8 = coffee[coffee['country_of_origin'].isin(['Mexico',  
                                         'Brazil', 
                                         'Taiwan',
                                         'Ethiopia', 
                                         'Tanzania, United Republic Of',
                                         'Indonesia'])]

top8['country_of_origin'].replace('Tanzania, United Republic Of', 'Tanzania')

top8['country_of_origin'].value_counts()
Out[28]:
country_of_origin
Mexico                          236
Brazil                          132
Taiwan                           75
Ethiopia                         44
Tanzania, United Republic Of     40
Indonesia                        20
Name: count, dtype: int64
In [29]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
sig = 0.05 # significance threshold
tukey_results = pairwise_tukeyhsd(top8['aftertaste'], top8['country_of_origin'], sig)
print(tukey_results)
# under the “reject” column, if True, then there is a significant difference; if False, no significant difference.
             Multiple Comparison of Means - Tukey HSD, FWER=0.05             
=============================================================================
  group1             group2            meandiff p-adj   lower   upper  reject
-----------------------------------------------------------------------------
   Brazil                     Ethiopia   0.4533    0.0  0.3038  0.6028   True
   Brazil                    Indonesia   -0.025 0.9993 -0.2311   0.181  False
   Brazil                       Mexico  -0.2125    0.0 -0.3058 -0.1191   True
   Brazil                       Taiwan  -0.0451  0.905 -0.1692  0.0791  False
   Brazil Tanzania, United Republic Of  -0.0155 0.9997 -0.1705  0.1395  False
 Ethiopia                    Indonesia  -0.4784    0.0 -0.7099 -0.2468   True
 Ethiopia                       Mexico  -0.6658    0.0 -0.8068 -0.5248   True
 Ethiopia                       Taiwan  -0.4984    0.0 -0.6615 -0.3353   True
 Ethiopia Tanzania, United Republic Of  -0.4689    0.0 -0.6565 -0.2813   True
Indonesia                       Mexico  -0.1874 0.0808 -0.3874  0.0125  False
Indonesia                       Taiwan    -0.02 0.9998 -0.2361  0.1961  False
Indonesia Tanzania, United Republic Of   0.0095    1.0 -0.2257  0.2447  False
   Mexico                       Taiwan   0.1674 0.0004  0.0536  0.2812   True
   Mexico Tanzania, United Republic Of   0.1969 0.0019  0.0501  0.3438   True
   Taiwan Tanzania, United Republic Of   0.0295 0.9961 -0.1386  0.1977  False
-----------------------------------------------------------------------------
In [30]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
sig = 0.05 # significance threshold
tukey_results = pairwise_tukeyhsd(top8['acidity'], top8['country_of_origin'], sig)
print(tukey_results)
# under the “reject” column, if True, then there is a significant difference; if False, no significant difference.
             Multiple Comparison of Means - Tukey HSD, FWER=0.05             
=============================================================================
  group1             group2            meandiff p-adj   lower   upper  reject
-----------------------------------------------------------------------------
   Brazil                     Ethiopia   0.5322    0.0  0.3957  0.6687   True
   Brazil                    Indonesia  -0.0204 0.9996 -0.2087  0.1678  False
   Brazil                       Mexico  -0.0831  0.061 -0.1683  0.0022  False
   Brazil                       Taiwan   -0.096 0.1511 -0.2094  0.0174  False
   Brazil Tanzania, United Republic Of  -0.0114 0.9999  -0.153  0.1301  False
 Ethiopia                    Indonesia  -0.5526    0.0 -0.7642 -0.3411   True
 Ethiopia                       Mexico  -0.6153    0.0 -0.7441 -0.4865   True
 Ethiopia                       Taiwan  -0.6282    0.0 -0.7771 -0.4792   True
 Ethiopia Tanzania, United Republic Of  -0.5436    0.0  -0.715 -0.3723   True
Indonesia                       Mexico  -0.0627 0.9239 -0.2453    0.12  False
Indonesia                       Taiwan  -0.0755 0.8836 -0.2729  0.1219  False
Indonesia Tanzania, United Republic Of    0.009    1.0 -0.2058  0.2238  False
   Mexico                       Taiwan  -0.0129 0.9993 -0.1169  0.0911  False
   Mexico Tanzania, United Republic Of   0.0717 0.6464 -0.0625  0.2058  False
   Taiwan Tanzania, United Republic Of   0.0845  0.616  -0.069  0.2381  False
-----------------------------------------------------------------------------
In [31]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
sig = 0.05 # significance threshold
tukey_results = pairwise_tukeyhsd(top8['aftertaste'], top8['country_of_origin'], sig)
print(tukey_results)
# under the “reject” column, if True, then there is a significant difference; if False, no significant difference.
             Multiple Comparison of Means - Tukey HSD, FWER=0.05             
=============================================================================
  group1             group2            meandiff p-adj   lower   upper  reject
-----------------------------------------------------------------------------
   Brazil                     Ethiopia   0.4533    0.0  0.3038  0.6028   True
   Brazil                    Indonesia   -0.025 0.9993 -0.2311   0.181  False
   Brazil                       Mexico  -0.2125    0.0 -0.3058 -0.1191   True
   Brazil                       Taiwan  -0.0451  0.905 -0.1692  0.0791  False
   Brazil Tanzania, United Republic Of  -0.0155 0.9997 -0.1705  0.1395  False
 Ethiopia                    Indonesia  -0.4784    0.0 -0.7099 -0.2468   True
 Ethiopia                       Mexico  -0.6658    0.0 -0.8068 -0.5248   True
 Ethiopia                       Taiwan  -0.4984    0.0 -0.6615 -0.3353   True
 Ethiopia Tanzania, United Republic Of  -0.4689    0.0 -0.6565 -0.2813   True
Indonesia                       Mexico  -0.1874 0.0808 -0.3874  0.0125  False
Indonesia                       Taiwan    -0.02 0.9998 -0.2361  0.1961  False
Indonesia Tanzania, United Republic Of   0.0095    1.0 -0.2257  0.2447  False
   Mexico                       Taiwan   0.1674 0.0004  0.0536  0.2812   True
   Mexico Tanzania, United Republic Of   0.1969 0.0019  0.0501  0.3438   True
   Taiwan Tanzania, United Republic Of   0.0295 0.9961 -0.1386  0.1977  False
-----------------------------------------------------------------------------

Do there exist correlations between altitude and certain taste quantifications?¶

Data on altitude was extremely messy -- lots of variation in hyphens, where units of measurement were placed, etc. All measurements seem to be consistently in meters. Each instance of the new altitude column is the minimum altitude, both for ease in data cleaning, and to keep the estimates conservative.

In [34]:
coffee.dropna(subset=['altitude'])
coffee['altitude'] = coffee['altitude'].str[:4]
coffee['altitude'] = coffee['altitude'].apply(lambda x: str(x)[:4] if str(x)[:4].isdigit() else None)
coffee['altitude'] = pd.to_numeric(coffee['altitude'], errors='coerce').astype('Int64')
coffee['altitude'].head()
Out[34]:
0    1950
1    1950
2    1600
3    1800
4    1950
Name: altitude, dtype: Int64
In [35]:
coffee_salient = coffee[['altitude', 'aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup', 'uniformity', 'sweetness', 'moisture']]
feature = coffee['altitude']
labels = coffee[['aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup', 'uniformity', 'sweetness', 'moisture']]

Since data has missing values, we will fill the 'altitude' column using the mean of the altitude.

In [37]:
from sklearn.impute import SimpleImputer
import numpy as np

x_train, x_test, y_train, y_test = train_test_split(feature, labels, test_size=0.2, train_size=0.8, random_state=27)

imputer = SimpleImputer(strategy='mean')

x_train = np.array(x_train).reshape(-1,1)
x_train = imputer.fit_transform(x_train)

x_test = np.array(x_test).reshape(-1,1)
x_test = imputer.fit_transform(x_test)
In [38]:
mlr = LinearRegression()
model = mlr.fit(x_train, y_train)
points_pred = model.predict(x_test)
In [39]:
labels.columns
Out[39]:
Index(['aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup',
       'uniformity', 'sweetness', 'moisture'],
      dtype='object')
In [40]:
model.coef_
Out[40]:
array([[-2.12529437e-05],
       [ 1.66793663e-05],
       [ 3.94667029e-05],
       [-7.61694694e-06],
       [ 8.81035253e-08],
       [ 3.20848528e-05],
       [ 7.51252524e-06],
       [-1.67944937e-06],
       [ 1.29207158e-06]])
In [41]:
coefs = []
for subset in model.coef_:
    for coef in subset:
        coefs.append(coef)

#print(coefs)

cols = [col for col in features.columns]
#print(cols)

feature_coefs = pd.DataFrame({'features': cols,
                              'coefficients': coefs})

feature_coefs.sort_values('coefficients', ascending=False)
feature_coefs
Out[41]:
features coefficients
0 aftertaste -2.125294e-05
1 aroma 1.667937e-05
2 acidity 3.946670e-05
3 body -7.616947e-06
4 balance 8.810353e-08
5 clean_cup 3.208485e-05
6 uniformity 7.512525e-06
7 sweetness -1.679449e-06
8 moisture 1.292072e-06

Overall Takeaways and Answers¶

What palate-related variable has the highest correlation with score?¶

  • 'aftertaste' has the highest correlation with score. The scikit-learn linear regression coefficient score for aftertaste is 1.967.

Are there statistically significant diferences in taste quantifications with respect to diferent countries of origin?¶

Statistically significant differences were found between the scores of the following nations under the forementioned categories:

(Bold text indicates the nation(s) with significantly higher scores than their counterpart nations in their respective categories)

aftertaste:

  • Brazil and Ethiopia
  • Brazil and Mexico
  • Ethiopia and Indonesia, Mexico, Taiwan and Tanzania
  • Mexico and Taiwan, Tanzania

acidity:

  • Brazil and Ethiopia
  • Ethiopia and Indonesia, Mexico, Taiwan and Tanzania

aftertaste:

  • Brazil and Ethiopia
  • Brazil and Mexico
  • Ethiopia and Indonesia, Mexico, Taiwan and Tanzania
  • Mexico and Taiwan, Tanzania

A significance threshold of a = 0.05 was used. ANOVA using a Tukey range test was applied to achieve the p-scores.

Do there exist strong correlations between altitude and certain taste quantifications?¶

  • The coefficients outputted by the model are quite small, likely due to a lack of feature scaling or incidence of multicollinearity, not within the scope of the course for which this project is created.

  • In relative terms, altitude puts downward pressure on the scores of 'aftertaste' and 'body', and places upward pressure on balance, uniformity, acidity, and the propensity for the coffee cup to be clean after drinking to completion.

Location Metadata Questions¶

In [47]:
coffee.rename(columns={'owner': 'Owner', 'producer': 'Producer', 'country_of_origin': 'Country', 'region': 'Region', 'company': 'Company',
    'farm_name': 'Farm', 'lot': 'Lot', 'mill': 'Mill', 'altitude_mean_meters': 'Altitude (mean)','total_cup_points':'Total Cup Points', 
    'number_of_bags':'Number of Bags','altitude_low_meters':'Altitude (low)', 'altitude_high_meters':'Altitude (high)'}, inplace=True)
In [48]:
## What countries have the highest overall score?
In [49]:
# Group the data by country and calculate the mean of Total Cup Points
country_points = coffee.groupby('Country')['Total Cup Points'].mean().reset_index()

# Create a Choropleth map using Plotly
map = px.choropleth(
    country_points,
    locations='Country',
    locationmode='country names',
    color='Total Cup Points',
    hover_name='Country',
    color_continuous_scale=px.colors.diverging.Spectral,
    labels={'Total Cup Points': 'Average Cup Points'})

# Show the choropleth map
map.show()
In [50]:
coffee.groupby('Country')['Total Cup Points'].mean().reset_index().sort_values(by='Total Cup Points', ascending=False)
Out[50]:
Country Total Cup Points
23 Papua New Guinea 85.750000
8 Ethiopia 85.484091
14 Japan 84.670000
31 United States 84.433000
15 Kenya 84.309600
22 Panama 83.707500
30 Uganda 83.451944
3 Colombia 83.106557
7 El Salvador 83.052857
2 China 82.927500
26 Rwanda 82.830000
4 Costa Rica 82.789020
29 Thailand 82.573750
13 Indonesia 82.565500
24 Peru 82.526000
0 Brazil 82.405909
28 Tanzania, United Republic Of 82.369500
27 Taiwan 82.001333
35 Zambia 81.920000
9 Guatemala 81.846575
16 Laos 81.833333
1 Burundi 81.830000
32 United States (Hawaii) 81.820411
33 United States (Puerto Rico) 81.727500
17 Malawi 81.711818
34 Vietnam 81.208750
12 India 81.082857
19 Mexico 80.890085
25 Philippines 80.834000
20 Myanmar 80.750000
18 Mauritius 80.500000
21 Nicaragua 80.458077
6 Ecuador 80.220000
11 Honduras 79.357547
5 Cote d?Ivoire 79.330000
10 Haiti 77.180000

The Country of Origin that has the highest overral score is Papua New Guinea with 85.75 points. On second place we have Ethiopia with 85.48 points and third place is Japan with 84.67 points. By the other hand, the Haiti has the lowest overral score from the coffee database.

description

Which countries produce the most amounts of coffee? (country, number of bags)¶

In [54]:
coffee.groupby('Country')['Number of Bags'].sum().reset_index().sort_values(by='Number of Bags', ascending=False)
Out[54]:
Country Number of Bags
3 Colombia 41204
9 Guatemala 36868
0 Brazil 30534
19 Mexico 24140
11 Honduras 13167
8 Ethiopia 11761
4 Costa Rica 10354
21 Nicaragua 6406
30 Uganda 5477
7 El Salvador 4449
15 Kenya 3971
28 Tanzania, United Republic Of 3760
12 India 3011
24 Peru 2336
27 Taiwan 1914
13 Indonesia 1658
29 Thailand 1310
32 United States (Hawaii) 833
17 Malawi 557
22 Panama 537
1 Burundi 520
31 United States 462
10 Haiti 390
25 Philippines 259
26 Rwanda 150
16 Laos 81
33 United States (Puerto Rico) 71
2 China 55
14 Japan 20
35 Zambia 13
34 Vietnam 11
20 Myanmar 10
23 Papua New Guinea 7
6 Ecuador 3
5 Cote d?Ivoire 2
18 Mauritius 1

The country that produces the most amount of coffee is Colombia with 41,205 number of bags. On second place we have Guatemala with 85.48 points and third place is Brazil with 84.67 points. By the other hand, the Haiti has the lowest overral score from the coffee database.

description

What location-related variable has the highest correlation with score?¶

In [58]:
numeric_data = coffee[['Altitude (mean)', 'Altitude (low)', 'Altitude (high)'] + ['Total Cup Points']].dropna()
location_correlation = numeric_data.corr().drop('Total Cup Points')

# Correlation heatmap
sns.heatmap(location_correlation, annot=True, cmap='coolwarm')

plt.title('Correlation Map (Location variables)')
plt.show()
No description has been provided for this image

If we analyze the data in general, the correlation between the altitude and the total cup points indicates that there is almost no relation between these variables, but we want to get the correlation divided by the country of origin.

In [60]:
# Group by Country the correlation with the mean altitude
country_altitude_corr = coffee.groupby('Country').apply(
    lambda group: group[['Altitude (mean)', 'Total Cup Points']].corr().iloc[0, 1]).dropna().sort_values(ascending=False).reset_index()

country_altitude_corr.rename(columns={0:'Correlation'})
C:\Users\berbr\AppData\Local\Temp\ipykernel_15628\895299115.py:2: DeprecationWarning:

DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

Out[60]:
Country Correlation
0 United States (Hawaii) 1.000000
1 Peru 1.000000
2 Ecuador 1.000000
3 Burundi 1.000000
4 Haiti 0.742130
5 Philippines 0.726317
6 India 0.587434
7 Uganda 0.478961
8 Ethiopia 0.415867
9 Taiwan 0.399207
10 Malawi 0.399116
11 Thailand 0.381836
12 United States 0.274804
13 Costa Rica 0.235791
14 Panama 0.191214
15 El Salvador 0.125397
16 Kenya 0.112667
17 Honduras 0.048521
18 Mexico 0.042369
19 Nicaragua 0.037648
20 Brazil 0.027320
21 Colombia -0.006585
22 Guatemala -0.053725
23 Tanzania, United Republic Of -0.080831
24 China -0.107649
25 Vietnam -0.141472
26 Myanmar -0.150707
27 Laos -0.154993
28 Indonesia -0.565402

Altitude tends to have a more effect on coffee quality in certain regions (for example, higher altitudes in Ethiopia or Colombia), but this relationship may vary by location.

Which farms, mills and producers have the most consistent coffee quality over time?¶

In [63]:
# Calculate the standard deviation of 'total_cup_points' for each farm
farm_consistency = coffee.groupby('Farm')['Total Cup Points'].std().dropna().sort_values().reset_index()
farm_consistency.head()
Out[63]:
Farm Total Cup Points
0 gran manzana y el aguacate 0.113137
1 el barbaro 0.113137
2 gao chun fang 高醇坊 0.120208
3 coopetarrazú 0.120208
4 la orduña, coatepec, veracruz 0.120208

The farms that are most consistent with the coffee quality they produce are Gran Manzana y El Aguacate and El Barbaro, both with the lowest standar deviation of 0.113137.

In [65]:
# Calculate the standard deviation of 'total_cup_points' for each mill
mill_consistency = coffee.groupby('Mill')['Total Cup Points'].std().dropna().sort_values().reset_index()
mill_consistency.head()
Out[65]:
Mill Total Cup Points
0 ibero 0.000000
1 el olvido, zentla, veracruz 0.000000
2 internacional armazens gerais 0.056569
3 17/18 0.063640
4 santa adelaida 0.113137

The mills that are most consistent with the coffee quality they produce are Ibero and El Olvido, Zentla, Veracruz, both with the lowest standar deviation of 0.

In [67]:
# Calculate the standard deviation of 'total_cup_points' for each producer
producer_consistency = coffee.groupby('Producer')['Total Cup Points'].std().dropna().sort_values().reset_index()
producer_consistency.head()
Out[67]:
Producer Total Cup Points
0 Santos Fonseca 0.000000
1 Maria Rogeria Costa Pereira 0.000000
2 Assefa Belay Coffee Producer 0.056569
3 COFFEE FOR PEACE,INC. 0.113137
4 El Barbaro, S.A. de C.V. 0.113137

The producers that are most consistent with the coffee quality they produce are Santos Fonseca and Maria Rogeria Costa Pereira, both with the lowest standar deviation of 0.

These farms, mills and producers with the lowest standard deviation indicate stable growing conditions, more reliable practices, and consistent quality coffee.

Bean Metadata Questions¶

Which processing method is used the most?¶

In [71]:
pm_counts = coffee['processing_method'].value_counts().reset_index()
print(pm_counts)
pm_bar = px.bar(pm_counts, x=pm_counts['processing_method'], y=pm_counts['count'])
pm_bar.show()
           processing_method  count
0               Washed / Wet    815
1              Natural / Dry    258
2  Semi-washed / Semi-pulped     56
3                      Other     26
4     Pulped natural / honey     14

In broad percentages, Wet processing makes up 69.7%, Dry makes up just over 22%, Semi-Wet makes up >5%, Other makes up >2.5%, and Honey makes up just over 1%.

Which processing method has the highest number of bags?¶

In [74]:
pm_bags = coffee.groupby('processing_method')['Number of Bags'].sum().reset_index().sort_values(by='Number of Bags', ascending=False)
print(pm_bags)
pm_bags_bar = px.bar(pm_bags, x=pm_bags['processing_method'], y=pm_bags['Number of Bags'])
pm_bags_bar.show()
           processing_method  Number of Bags
4               Washed / Wet          128071
0              Natural / Dry           40581
3  Semi-washed / Semi-pulped            5304
1                      Other            3342
2     Pulped natural / honey            2341

Wet processing method makes up over 71% of the individual bags processed, with Dry coming second at 22.5%. The remaining methods make up only 6.5% of bags.

Did any product get graded the same year it was harvested?¶

In [77]:
# Clean grading_date column which is by default un-parseable for to_datetime
def remove_date_suffix(column):
    return re.sub(r"\b([0123]?[0-9])(st|th|nd|rd)\b",r"\1", column)
coffee['grading_date'] = coffee['grading_date'].apply(remove_date_suffix)
coffee['grading_date'] = pd.to_datetime(coffee['grading_date'], format='mixed')
In [78]:
# Create and convert year to object dtype
coffee['year'] = coffee['grading_date'].dt.year
coffee['year'] = coffee['year'].astype(str)
In [79]:
# Preliminary analysis of harvest vs graded year.
coffee['harvest_vs_grading_years'] = coffee['harvest_year'].equals(coffee['year'])
year_comparison = coffee['harvest_vs_grading_years'].value_counts().reset_index()
print(year_comparison)
   harvest_vs_grading_years  count
0                     False   1339

Without deep-cleaning the messy harvest_year column, there are no immediate direct matches between the extracted grading year and harvest year. It is likely that values in nonstandard formats represent the observations with matching harvest and grading years. Further data cleaning is needed.

Q Grader Metadata Questions¶

Clean Data¶

Check for Duplicates¶

In [83]:
duplicated_coffee = coffee[coffee.duplicated()]
print("Duplicate rows: ", duplicated_coffee.shape)
Duplicate rows:  (0, 45)

Grab relevant columns¶

In [85]:
coffee.rename(columns={'Total Cup Points': 'total_cup_points'}, inplace=True)
qBasedCoffee = coffee[['total_cup_points','certification_body']]
qBasedCoffee.head()
Out[85]:
total_cup_points certification_body
0 90.58 METAD Agricultural Development plc
1 89.92 METAD Agricultural Development plc
2 89.75 Specialty Coffee Association
3 89.00 METAD Agricultural Development plc
4 88.83 METAD Agricultural Development plc

Visualization¶

Let's display a basic chart to better visualize the distribution of certification bodies and total cup points.

In [87]:
plt.figure(figsize=(20, 6))
plt.bar(qBasedCoffee['certification_body'], qBasedCoffee['total_cup_points'], width = 0.4) 
plt.xlabel("Certification Bodies")
plt.ylabel("Total Cup Points")
plt.title("Average Cup Points per Certification Body")
ax1 = plt.subplot()
certBodies = qBasedCoffee.certification_body.unique()
ax1.set_xticklabels(certBodies, rotation=90)
plt.show()
C:\Users\berbr\AppData\Local\Temp\ipykernel_15628\2269462007.py:8: UserWarning:

set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.

C:\Users\berbr\AppData\Roaming\Python\Python312\site-packages\IPython\core\pylabtools.py:170: UserWarning:

) missing from font(s) DejaVu Sans.

No description has been provided for this image

Visual Inspection¶

You can see perhaps a slight downwards slope but nothing significant. Let's check by running a linear regression model.

In [89]:
### Basic Linear Model
from statsmodels.formula.api import ols

fit = ols('total_cup_points ~ C(certification_body)', data=qBasedCoffee).fit() 
fit.summary()
Out[89]:
OLS Regression Results
Dep. Variable: total_cup_points R-squared: 0.122
Model: OLS Adj. R-squared: 0.105
Method: Least Squares F-statistic: 7.277
Date: Tue, 15 Oct 2024 Prob (F-statistic): 7.08e-24
Time: 01:30:20 Log-Likelihood: -3490.3
No. Observations: 1339 AIC: 7033.
Df Residuals: 1313 BIC: 7168.
Df Model: 25
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 80.8227 0.231 349.424 0.000 80.369 81.276
C(certification_body)[T.Africa Fine Coffee Association] 1.5265 0.527 2.899 0.004 0.493 2.560
C(certification_body)[T.Almacafé] 2.5233 0.339 7.437 0.000 1.858 3.189
C(certification_body)[T.Asociacion Nacional Del Café] 0.8298 0.353 2.354 0.019 0.138 1.521
C(certification_body)[T.Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C.] 1.8573 1.372 1.354 0.176 -0.834 4.548
C(certification_body)[T.Asociación de Cafés Especiales de Nicaragua] -0.6465 1.194 -0.542 0.588 -2.988 1.695
C(certification_body)[T.Blossom Valley International] 1.3947 0.493 2.832 0.005 0.428 2.361
C(certification_body)[T.Blossom Valley International ] -0.4027 3.320 -0.121 0.903 -6.915 6.110
C(certification_body)[T.Brazil Specialty Coffee Association] 0.9834 0.466 2.110 0.035 0.069 1.898
C(certification_body)[T.Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao] 2.3473 3.320 0.707 0.480 -4.165 8.860
C(certification_body)[T.Centro Agroecológico del Café A.C.] 1.5223 1.194 1.275 0.202 -0.819 3.864
C(certification_body)[T.Coffee Quality Institute] -0.0013 1.273 -0.001 0.999 -2.498 2.496
C(certification_body)[T.Ethiopia Commodity Exchange] 4.3573 0.814 5.352 0.000 2.760 5.954
C(certification_body)[T.Instituto Hondureño del Café] -1.6487 0.486 -3.392 0.001 -2.602 -0.695
C(certification_body)[T.Kenya Coffee Traders Association] 2.9491 0.743 3.969 0.000 1.492 4.407
C(certification_body)[T.METAD Agricultural Development plc] 5.8433 0.886 6.596 0.000 4.105 7.581
C(certification_body)[T.NUCOFFEE] 2.5617 0.598 4.280 0.000 1.388 3.736
C(certification_body)[T.Salvadoran Coffee Council] 1.6564 1.025 1.616 0.106 -0.354 3.667
C(certification_body)[T.Specialty Coffee Association] 1.1459 0.298 3.851 0.000 0.562 1.730
C(certification_body)[T.Specialty Coffee Association of Costa Rica] 1.7159 0.555 3.089 0.002 0.626 2.806
C(certification_body)[T.Specialty Coffee Association of Indonesia] 1.7823 1.073 1.662 0.097 -0.322 3.886
C(certification_body)[T.Specialty Coffee Institute of Asia] 3.5160 0.860 4.090 0.000 1.830 5.202
C(certification_body)[T.Tanzanian Coffee Board] 1.6923 1.372 1.234 0.218 -0.999 4.383
C(certification_body)[T.Torch Coffee Lab Yunnan] 1.9273 2.353 0.819 0.413 -2.689 6.544
C(certification_body)[T.Uganda Coffee Development Authority] 2.4641 0.629 3.915 0.000 1.229 3.699
C(certification_body)[T.Yunnan Coffee Exchange] 2.9481 0.984 2.997 0.003 1.018 4.878
Omnibus: 2321.520 Durbin-Watson: 0.866
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3642897.434
Skew: -11.258 Prob(JB): 0.00
Kurtosis: 257.534 Cond. No. 38.7


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Preliminary Anaylsis¶

With an R^2 value of only 0.122, there is little evidence to suggest the certificaiton body is significant in predicting the total_cup_points.

Narrowing Categories¶

Let's see if we only use the top few certification bodies we can find some sort of difference.

In [92]:
certFreq = qBasedCoffee['certification_body'].value_counts()
certFreq = certFreq.to_frame()
certFreq
Out[92]:
count
certification_body
Specialty Coffee Association 313
AMECAFE 205
Almacafé 178
Asociacion Nacional Del Café 155
Brazil Specialty Coffee Association 67
Instituto Hondureño del Café 60
Blossom Valley International 58
Africa Fine Coffee Association 49
Specialty Coffee Association of Costa Rica 43
NUCOFFEE 36
Uganda Coffee Development Authority 32
Kenya Coffee Traders Association 22
Ethiopia Commodity Exchange 18
Specialty Coffee Institute of Asia 16
METAD Agricultural Development plc 15
Yunnan Coffee Exchange 12
Salvadoran Coffee Council 11
Specialty Coffee Association of Indonesia 10
Centro Agroecológico del Café A.C. 8
Asociación de Cafés Especiales de Nicaragua 8
Coffee Quality Institute 7
Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C. 6
Tanzanian Coffee Board 6
Torch Coffee Lab Yunnan 2
Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao 1
Blossom Valley International\r\n 1
In [93]:
topBodies = ('Specialty Coffee Association','AMECAFE','Almacafé','Asociacion Nacional Del Café')
qBasedCoffeeNarrowed = qBasedCoffee.loc[qBasedCoffee['certification_body'].isin(topBodies)]
print(qBasedCoffeeNarrowed)
      total_cup_points            certification_body
2                89.75  Specialty Coffee Association
11               87.92                      Almacafé
12               87.92                      Almacafé
13               87.92  Specialty Coffee Association
15               87.58                      Almacafé
...                ...                           ...
1334             78.75  Specialty Coffee Association
1335             78.08  Specialty Coffee Association
1336             77.17  Specialty Coffee Association
1337             75.08  Specialty Coffee Association
1338             73.75  Specialty Coffee Association

[851 rows x 2 columns]
In [94]:
from statsmodels.formula.api import ols

fit = ols('total_cup_points ~ C(certification_body)', data=qBasedCoffeeNarrowed).fit() 
fit.summary()
Out[94]:
OLS Regression Results
Dep. Variable: total_cup_points R-squared: 0.099
Model: OLS Adj. R-squared: 0.096
Method: Least Squares F-statistic: 31.05
Date: Tue, 15 Oct 2024 Prob (F-statistic): 4.75e-19
Time: 01:30:20 Log-Likelihood: -2012.4
No. Observations: 851 AIC: 4033.
Df Residuals: 847 BIC: 4052.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 80.8227 0.180 448.349 0.000 80.469 81.177
C(certification_body)[T.Almacafé] 2.5233 0.264 9.542 0.000 2.004 3.042
C(certification_body)[T.Asociacion Nacional Del Café] 0.8298 0.275 3.020 0.003 0.291 1.369
C(certification_body)[T.Specialty Coffee Association] 1.1459 0.232 4.941 0.000 0.691 1.601
Omnibus: 381.208 Durbin-Watson: 0.262
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3180.077
Skew: -1.833 Prob(JB): 0.00
Kurtosis: 11.732 Cond. No. 4.98


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Further Analysis¶

After narrowing the certification bodies analyzed, the linear model output an even worse R^2 value of 0.099. This leads us to believe that this variable is counterproductive to our anaylsis.

Answering the Research Questions¶

  1. What metrics from Q Graders have any significant effect on score?
  • certification_body, certification_address, and certification_contact are the only variables related to Q Graders we are given in this data set. The only useful variable is the first one, because the second and third are essentially just identifying numbers for each grader. After some basic linear regression model anaylsis, the certification_body doesn't seem to be helpful.
  1. Is this category useful?
  • This category does not seem to be usefull without further information on the Q Graders themselves. We suggest ommitting the variable and focusing elsewhere.